Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

نویسندگان

  • Maria Chtepen
  • Filip H. A. Claeys
  • Bart Dhoedt
  • Filip De Turck
  • Peter A. Vanrolleghem
  • Piet Demeester
چکیده

As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant run-time overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. This paper presents a dynamic scheduling algorithm that switches between periodic checkpointing and replication to exploit the advantages of both techniques and to reduce the overhead. Furthermore, several heuristics are discussed that perform on-line adaptive tuning of the checkpointing period based on historical information on resource behavior. Simulationbased comparison of the proposed combined algorithm versus traditional strategies based on checkpointing and replication only, suggests significant reduction of average task makespan for systems with varying load.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Fault TOLERANCE IN GRID COMPUTING: STATE OF THE ART AND OPEN ISSUES

Fault tolerance is an important property for large scale computational grid systems, where geographically distributed nodes co-operate to execute a task. In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QOS req...

متن کامل

A Survey on Task Checkpointing and Replication based Fault Tolerance in Grid Computing

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken in to account. Commonly utilized techniques for providing fault tolerance in distributed...

متن کامل

Grid Computing and Fault Tolerance Approach

Grid computing is a means of allocating the computational power of a large number of computers to complex difficult computation or problem. Grid computing is a distributed computing paradigm that differs from traditional distributed computing in that it is aimed toward large scale systems that even span organizational boundaries. This paper proposes a method to achieve maximum fault tolerance i...

متن کامل

Providing Fault Tolerance in Grid Computing Systems

In grid computing, resources are used outside the boundary of organizations and it becomes increasingly difficult to guarantee that resources being used are not malicious. Also, resources may enter and leave the grid at any time. So, fault tolerance is a crucial issue in grid computing. Fault tolerance can enhance grid throughput, utilization, response time and more economic profits. All mechan...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007